For advertising effectiveness and economic forecasting.
Acquired Revolution R company and use it for a various purposes.
For behavior analysis related to status updates and profile pictures.
For data visualization and semantic clustering.
For statistical analysis.
Scale data science.
For data curation, analysis and visualisation.
And many more…
Data can be acquired from many sources into R. R supports data formats like csv, xlsx, spss, sas or any remote database like MySQL, SQLite, PostgreSQL, MonetDB, etc
The most used methods are to read data from a csv, xlxs or txt file or connecting to MySQL or SQLite data base
Used for obtaining rectangular data into R like “csv”, “tsv”, and “fwf”
Used to import excel files into R
R interface to Apache Spark to work with big data
Manage Google Drive files from R.
Interact with Google Sheets from R.
This package is wrapped around the ‘xml2’ and ‘httr’ packages to make it easy to download and manipulate
We can read a .csv data using the base read.csv() function or using read_csv() function from the readr package
data <- read.csv("datasets/adult_data.csv")
names(data) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week", "native_country", "predictive_variable")
head(data)
## age workclass fnlwgt education education_num
## 1 50 Self-emp-not-inc 83311 Bachelors 13
## 2 38 Private 215646 HS-grad 9
## 3 53 Private 234721 11th 7
## 4 28 Private 338409 Bachelors 13
## 5 37 Private 284582 Masters 14
## 6 49 Private 160187 9th 5
## marital_status occupation relationship race gender
## 1 Married-civ-spouse Exec-managerial Husband White Male
## 2 Divorced Handlers-cleaners Not-in-family White Male
## 3 Married-civ-spouse Handlers-cleaners Husband Black Male
## 4 Married-civ-spouse Prof-specialty Wife Black Female
## 5 Married-civ-spouse Exec-managerial Wife White Female
## 6 Married-spouse-absent Other-service Not-in-family Black Female
## capital_gain capital_loss hours_per_week native_country
## 1 0 0 13 United-States
## 2 0 0 40 United-States
## 3 0 0 40 United-States
## 4 0 0 40 Cuba
## 5 0 0 40 United-States
## 6 0 0 16 Jamaica
## predictive_variable
## 1 <=50K
## 2 <=50K
## 3 <=50K
## 4 <=50K
## 5 <=50K
## 6 <=50K
In order to obtain data from remote database like SQLLite First we need to establish a connection to the database
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
Then we can use this connection object to access and edit the database
dbListTables(con)
## [1] "iris" "mtcars"
mtcarsData <- dbReadTable(con, "mtcars")
str(mtcarsData)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dbDisconnect(con)
In the real world the data is not always “clean”. There are many ways to define clean. Common things to look out for:
dplyr is one of the most used package for data wrangling in R
Also a very popular package for data wrangling
Used for string manupulation
Used to work with dates data
Used to work with time data
str(data)
## 'data.frame': 32560 obs. of 15 variables:
## $ age : int 50 38 53 28 37 49 52 31 42 37 ...
## $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 7 5 5 5 5 5 7 5 5 5 ...
## $ fnlwgt : int 83311 215646 234721 338409 284582 160187 209642 45781 159449 280464 ...
## $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 12 2 10 13 7 12 13 10 16 ...
## $ education_num : int 13 9 7 13 14 5 9 14 13 10 ...
## $ marital_status : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 1 3 3 3 4 3 5 3 3 ...
## $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 5 7 7 11 5 9 5 11 5 5 ...
## $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 1 2 1 6 6 2 1 2 1 1 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 3 3 5 3 5 5 5 3 ...
## $ gender : Factor w/ 2 levels " Female"," Male": 2 2 2 1 1 1 2 1 2 2 ...
## $ capital_gain : int 0 0 0 0 0 0 0 14084 5178 0 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week : int 13 40 40 40 40 16 45 50 40 80 ...
## $ native_country : Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 6 40 24 40 40 40 40 ...
## $ predictive_variable: Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 2 2 2 2 ...
summary(data)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117832
## Median :37.00 Local-gov : 2093 Median : 178363
## Mean :38.58 ? : 1836 Mean : 189782
## 3rd Qu.:48.00 State-gov : 1297 3rd Qu.: 237055
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education_num marital_status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5354 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10682
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship
## Prof-specialty :4140 Husband :13193
## Craft-repair :4099 Not-in-family : 8304
## Exec-managerial:4066 Other-relative: 981
## Adm-clerical :3769 Own-child : 5068
## Sales :3650 Unmarried : 3446
## Other-service :3295 Wife : 1568
## (Other) :9541
## race gender capital_gain
## Amer-Indian-Eskimo: 311 Female:10771 Min. : 0
## Asian-Pac-Islander: 1039 Male :21789 1st Qu.: 0
## Black : 3124 Median : 0
## Other : 271 Mean : 1078
## White :27815 3rd Qu.: 0
## Max. :99999
##
## capital_loss hours_per_week native_country
## Min. : 0.00 Min. : 1.00 United-States:29169
## 1st Qu.: 0.00 1st Qu.:40.00 Mexico : 643
## Median : 0.00 Median :40.00 ? : 583
## Mean : 87.31 Mean :40.44 Philippines : 198
## 3rd Qu.: 0.00 3rd Qu.:45.00 Germany : 137
## Max. :4356.00 Max. :99.00 Canada : 121
## (Other) : 1709
## predictive_variable
## <=50K:24719
## >50K : 7841
##
##
##
##
##
Age
ggplot(data, aes(x = data$age)) + geom_bar()
Hours worked per week
ggplot(data, aes(x = data$hours_per_week)) + geom_histogram(binwidth=10)
Marital Status
data_marital_status <- data %>% group_by(marital_status) %>% summarise(count = n())
ggplotly(ggplot(data_marital_status, aes(x = reorder(marital_status, count), y = count)) + geom_col() + coord_flip())
ggplot(data_marital_status, aes(x = "", y = count, fill = reorder(marital_status, - count)))+
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0)
Education
ggplotly(ggplot(data %>% group_by(education) %>% summarise(count = n()), aes(x = reorder(education, count), y = count)) + geom_col() + coord_flip())
Occupation
ggplotly(ggplot(data %>% group_by(occupation) %>% summarise(count = n()), aes(x = reorder(occupation, count), y = count)) + geom_col() + coord_flip())
Relationship
ggplotly(ggplot(data %>% group_by(relationship) %>% summarise(count = n()), aes(x = reorder(relationship, count), y = count)) + geom_col() + coord_flip())
Race
ggplotly(ggplot(data %>% group_by(race) %>% summarise(count = n()), aes(x = reorder(race, count), y = count)) + geom_col() + coord_flip())
Gender
ggplotly(ggplot(data %>% group_by(gender) %>% summarise(count = n()), aes(x = reorder(gender, count), y = count)) + geom_col() + coord_flip())
Native Country
ggplotly(ggplot(data %>% group_by(native_country) %>% summarise(count = n()), aes(x = reorder(native_country, count), y = count)) + geom_col() + coord_flip())
People who study more make more money
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
## education number_of_rich_people number_of_poor_people total_people
## <fct> <dbl> <dbl> <int>
## 1 " 10th" 933 0 933
## 2 " 11th" 1175 0 1175
## 3 " 12th" 433 0 433
## 4 " 1st-4th" 168 0 168
## 5 " 5th-6th" 333 0 333
## 6 " 7th-8th" 646 0 646
## 7 " 9th" 514 0 514
## 8 " Assoc-acdm" 1067 0 1067
## 9 " Assoc-voc" 1382 0 1382
## 10 " Bachelors" 5354 0 5354
## 11 " Doctorate" 413 0 413
## 12 " HS-grad" 10501 0 10501
## 13 " Masters" 1723 0 1723
## 14 " Preschool" 51 0 51
## 15 " Prof-school" 576 0 576
## 16 " Some-college" 7291 0 7291
print(unique(data$predictive_variable))
## [1] <=50K >50K
## Levels: <=50K >50K
print(unique(as.character(data$predictive_variable)))
## [1] " <=50K" " >50K"
salary <- unique(as.character(data$predictive_variable))
str_sub(salary, 2, str_length(salary))
## [1] "<=50K" ">50K"
gsub(" ", "", salary)
## [1] "<=50K" ">50K"
trimws(salary)
## [1] "<=50K" ">50K"
salary
## [1] " <=50K" " >50K"
library(microbenchmark)
microbenchmark(str_sub(salary, 2, str_length(salary)), gsub(" ", "", salary), trimws(salary))
## Unit: microseconds
## expr min lq mean median uq
## str_sub(salary, 2, str_length(salary)) 2.9 3.4 4.795 4.2 4.60
## gsub(" ", "", salary) 4.7 5.4 7.537 6.2 6.95
## trimws(salary) 147.0 148.3 164.961 149.4 150.90
## max neval cld
## 28.1 100 a
## 70.0 100 a
## 326.5 100 b
salary <- str_sub(salary, 2, str_length(salary))
salary
## [1] "<=50K" ">50K"
data$predictive_variable <- str_sub(data$predictive_variable, 2, str_length(data$predictive_variable))
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
## education number_of_rich_people number_of_poor_people total_people
## <fct> <dbl> <dbl> <int>
## 1 " 10th" 62 871 933
## 2 " 11th" 60 1115 1175
## 3 " 12th" 33 400 433
## 4 " 1st-4th" 6 162 168
## 5 " 5th-6th" 16 317 333
## 6 " 7th-8th" 40 606 646
## 7 " 9th" 27 487 514
## 8 " Assoc-acdm" 265 802 1067
## 9 " Assoc-voc" 361 1021 1382
## 10 " Bachelors" 2221 3133 5354
## 11 " Doctorate" 306 107 413
## 12 " HS-grad" 1675 8826 10501
## 13 " Masters" 959 764 1723
## 14 " Preschool" 0 51 51
## 15 " Prof-school" 423 153 576
## 16 " Some-college" 1387 5904 7291
education_data <- distinct(data %>% select(education, education_num)) %>% arrange(education_num)
education_data
## education education_num
## 1 Preschool 1
## 2 1st-4th 2
## 3 5th-6th 3
## 4 7th-8th 4
## 5 9th 5
## 6 10th 6
## 7 11th 7
## 8 12th 8
## 9 HS-grad 9
## 10 Some-college 10
## 11 Assoc-voc 11
## 12 Assoc-acdm 12
## 13 Bachelors 13
## 14 Masters 14
## 15 Prof-school 15
## 16 Doctorate 16
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>%
group_by(education, education_num) %>%
summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
arrange(education_num)
education_summary
## # A tibble: 16 x 3
## # Groups: education [16]
## education education_num percentage_of_rich_people
## <fct> <int> <dbl>
## 1 " Preschool" 1 0
## 2 " 1st-4th" 2 3.57
## 3 " 5th-6th" 3 4.80
## 4 " 7th-8th" 4 6.19
## 5 " 9th" 5 5.25
## 6 " 10th" 6 6.65
## 7 " 11th" 7 5.11
## 8 " 12th" 8 7.62
## 9 " HS-grad" 9 16.0
## 10 " Some-college" 10 19.0
## 11 " Assoc-voc" 11 26.1
## 12 " Assoc-acdm" 12 24.8
## 13 " Bachelors" 13 41.5
## 14 " Masters" 14 55.7
## 15 " Prof-school" 15 73.4
## 16 " Doctorate" 16 74.1
ggplotly(ggplot(education_summary, aes(x = reorder(education, education_num), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())
ggplotly(ggplot(education_summary, aes(x = education_num, y = percentage_of_rich_people, color = education)) + geom_point())
Let’s just analyze which type of occupation makes more money
occupation_summary <- data %>%
group_by(occupation) %>%
summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
arrange(percentage_of_rich_people)
occupation_summary
## # A tibble: 15 x 2
## occupation percentage_of_rich_people
## <fct> <dbl>
## 1 " Priv-house-serv" 0.671
## 2 " Other-service" 4.16
## 3 " Handlers-cleaners" 6.28
## 4 " ?" 10.4
## 5 " Armed-Forces" 11.1
## 6 " Farming-fishing" 11.6
## 7 " Machine-op-inspct" 12.5
## 8 " Adm-clerical" 13.5
## 9 " Transport-moving" 20.0
## 10 " Craft-repair" 22.7
## 11 " Sales" 26.9
## 12 " Tech-support" 30.5
## 13 " Protective-serv" 32.5
## 14 " Prof-specialty" 44.9
## 15 " Exec-managerial" 48.4
ggplotly(ggplot(occupation_summary, aes(x = reorder(occupation, percentage_of_rich_people), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())
People who work more make more money
ggplot(data, aes(x = hours_per_week, fill = predictive_variable)) + geom_histogram(binwidth = 10)
Men make more money than Women?
ggplot(data, aes(x = gender, fill = predictive_variable)) + geom_bar(stat = "count")
There are many ways to predict a variable, depending on the data type.
If the variable you need to predict is a number, you need to use regression
If the variable you need to predict is a categorical, you need to use classification
Some famous regression algorithms are:
Some famous classification algorithms are:
All algorithms will generate a model/formula, Creating the model is often refered to as training
You can store the models in a variable and use them later on to predict.
# library(e1071)
# naive_bayes_model <- naiveBayes(predictive_variable ~ ., data = data)
# predicted_values_from_naive_bayes_model <- predict(naive_bayes_model, data)
# tab <- table(predicted_values_from_naive_bayes_model,data$predictive_variable)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)
# library(party)
# decision_tree_model <- ctree(predictive_variable ~ ., data = data)
# predicted_values_from_decision_tree_model <- predict(decision_tree_model, data)
# tab <- table(predicted_values_from_decision_tree_model,data$predictive_variable)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)
# library(randomForest)
# random_forest_model <- randomForest(predictive_variable ~ ., data = data)
# plot(random_forest_model)
# predicted_values_from_random_forest_model <- predict(random_forest_model, data)
# tab <- table(predicted_values_from_random_forest_model,data$predictive_variable)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)
Rmarkdown allows you to create reproducible results that can be shared as a HTML, PDF, PPT or Word format.
Shiny allows you to create interactive web applications.
newData <- data.frame(
Sepal.Length = c(5.7, 6.3, 7.2),
Sepal.Width = c(4.4, 2.9, 3.1),
Petal.Length = c(1.4, 4.0, 5.1),
Petal.Width = c(0.2, 1.0, 2.3),
Species = ""
)
newData
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 4.4 1.4 0.2
## 2 6.3 2.9 4.0 1.0
## 3 7.2 3.1 5.1 2.3
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplotly(ggplot(iris %>% group_by(Species) %>% summarise(count = n()), aes(x = reorder(Species, count), y = count)) + geom_col() + coord_flip())
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Petal.Width, color = Species)) + geom_point()
ggplot(iris, aes(x = Sepal.Width, y = Petal.Length, color = Species)) + geom_point()
ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) + geom_point()
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()
actual_species <- c("setosa", "versicolor", "virginica")
newData
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 4.4 1.4 0.2
## 2 6.3 2.9 4.0 1.0
## 3 7.2 3.1 5.1 2.3
# library(e1071)
# naive_bayes_model <- naiveBayes(Species ~ ., data = iris)
# predicted_values_from_naive_bayes_model <- predict(naive_bayes_model, iris)
# tab <- table(predicted_values_from_naive_bayes_model,iris$Species)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)
# # Predicting the a newData
# answer <- predict(naive_bayes_model, newData)
# tab <- table(answer,actual_species)
# print(tab)
# 1 - sum(diag(tab)) / sum(tab)